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ABSTRACT 

Materialized views (MVs), stored pre-computed results, are widely 
used to facilitate fast queries on large datasets. When new records 
arrive at a high rate, it is infeasible to continuously update (main¬ 
tain) MVs and a common solution is to defer maintenance by batch¬ 
ing updates together. Between batches the MVs become increas¬ 
ingly stale with incorrect, missing, and superfluous rows leading 
to increasingly inaccurate query results. We propose Stale View 
Cleaning (SVC) which addresses this problem from a data clean¬ 
ing perspective. In SVC, we efficiently clean a sample of rows from 
a stale MV, and use the clean sample to estimate aggregate query 
results. While approximate, the estimated query results reflect the 
most recent data. As sampling can be sensitive to long-tailed dis¬ 
tributions, we further explore an outlier indexing technique to give 
increased accuracy when the data distributions are skewed. SVC 
complements existing deferred maintenance approaches by giving 
accurate and bounded query answers between maintenance. We 
evaluate our method on a generated dataset from the TPC-D bench¬ 
mark and a real video distribution application. Experiments con¬ 
firm our theoretical results: (1) cleaning an MV sample is more 
efficient than full view maintenance, (2) the estimated results are 
more accurate than using the stale MV, and (3) SVC is applicable 
for a wide variety of MVs. 

1. INTRODUCTION 

Storing pre-computed query results, also known as materializa¬ 
tion, is an extensively studied approach to reduce query latency 
on large data |9|22|31) . Materialized Views (MVs) are now sup¬ 
ported by all major commercial vendors. However, as with any pre- 
computation or caching, the key challenge in using MVs is main¬ 
taining their freshness as base data changes. While there has been 
substantial work in incremental maintenance of MVs |9|27| , ea¬ 
ger maintenance (i.e., immediately applying updates) is not always 
feasible. 

In applications such as monitoring or visualization |35|49| , ana¬ 
lysts may create many MVs by slicing or aggregating over different 
dimensions. Eager maintenance requires updating all affected MVs 
for every incoming transaction, and thus, each additional MV re- 
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Figure 1: In SVC, we pose view maintenance as a sample-and- 
clean problem and show that we can use a sample of clean (up- 
to-date) rows from an MV to correct inaccurate query results 
on stale views. 

duces the available transaction throughput. This problem becomes 
significantly harder when the views are distributed and computa¬ 
tional resources are contended by other tasks. As a result, in pro¬ 
duction environments, it is common to batch updates together to 
amortize overheads j^. Batch sizes are set according to system 
constraints, and can vary from a few seconds to even nightly. 

While increasing the batching period gives the user more flexi¬ 
bility to schedule around system constraints, a disadvantage is that 
MVs are stale between maintenance periods. Other than an edu¬ 
cated guess based on past data, the user has no way of knowing 
how incorrect their query results are. Some types of views and 
query workloads can be sensitive to even a small number of base 
data updates, for example, if updates disproportionately affect a 
subset of frequently queried rows. Thus, any amount of staleness 
is potentially dangerous, and this presents us a dichotomy between 
facing the cost of eager maintenance or coping with consequences 
of unknown inaccuracy. In this paper, we explore an intriguing 
middle ground, namely, we can derive a bounded approximation of 
the correct answer for a fraction of the cost. With a small amount 
of up-to-date data, we can compensate for the error in aggregate 
query results induced by staleness. 

Our method relies on modeling query answering on stale MVs 
as a data cleaning problem. A stale MV has incorrect, missing, or 
superfluous rows, which are problems that have been studied in the 
data cleaning literature (e.g., see Rahm and Do for a survey |[45l). 
Increasing data volumes have led to development of new, efficient 
sampling-based approaches for coping with dirty data. In our prior 
work, we developed the SampleClean framework for scalable ag¬ 
gregate query processing on dirty data | [48) . Since data cleaning 
is often expensive, we proposed cleaning a sample of data and us¬ 
ing this sample to improve the results of aggregate queries on the 
full dataset. Since stale MVs are dirty data, an approach similar to 
SampleClean raises a new possibility of using a sample of “clean” 
rows in the MVs to return more accurate query results. 












Stale View Cleaning (SVC illustrated in Figure approximates 
aggregate query results from a stale MV and a small sample of up- 
to-date data. We calculate a relational expression that materializes 
a uniform sample of up-to-date rows. This expression can be in¬ 
terpreted as “cleaning” a stale sample of rows. We use the clean 
sample of rows to estimate a result for an aggregate query on the 
view. The estimates from this procedure, while approximate, re¬ 
flect the most recent data. Approximation error is more manageable 
than staleness because: (1) the uniformity of sampling allows us to 
apply theory from statistics such as the Central Limit Theorem to 
give tight bounds on approximate results, and (2) the approximate 
error is parametrized by the sample size which the user can control 
trading off accuracy for computation. However, the MV setting 
presents new challenges that we did not consider in prior work. To 
summarize our contributions:(1) a hashing-based technique that ef- 
flciently materializes an up-to-date sample view, (2) algorithms for 
processing general aggregate queries on a sample view and bound¬ 
ing results in confldence intervals, (3) an outlier indexing technique 
to reduce sensitivity to skewed datasets that can push the index up 
to derived relations, and (4) an evaluation of this technique on real 
and synthetic datasets to show that SVC gives highly accurate re¬ 
sults for a relatively small maintenance cost. 

The paper is organized as follows: In Section we give the 
necessary background for our work. Next, in Section we for¬ 
malize the problem. In Sections andwe describe the sampling 
and query processing of our technique. In Section we describe 
the outlier indexing framework. Then, in Section [71 we evaluate 
our approach. We discuss Related Work in Section^ Finally, we 
present our Conclusions in Section 

2. BACKGROUND 

2.1 Motivation and Example 

Materialized view maintenance can be very expensive resulting 
in staleness. Many important use-cases require creating a large 
number of views including: visualization, personalization, privacy, 
and real-time monitoring. The problem with eager maintenance is 
that every view created by an analyst places a bottleneck on in¬ 
coming transactions. There has been signiflcant research on fast 
MV maintenance algorithms, most recently DBToaster (27) which 
uses SQL query compilation and higher-order maintenance. How¬ 
ever, even with these optimizations, some materialized views are 
computationally difficult to incrementally maintain. For example, 
incremental maintenance of views with correlated subqueries can 
grow with the size of the data. It is also common to use the same 
infrastructure to maintain multiple MVs (along with other analyt¬ 
ics tasks) adding further contention to computational resources and 
reducing overall available throughput. When faced with such chal¬ 
lenges, one solution is to batch updates and amortize maintenance 
overheads. 

Log Analysis Example: Suppose we are a video streaming com¬ 
pany analyzing user engagement. Our database consists of two ta¬ 
bles Log and Video, with the following schema: 

Log(sessionid, videold) 

Video ( videold, ownerld, duration) 

The Log table stores each visit to a speciflc video with primary key 
(sessionid) and a foreign-key to the Video table (videold). 
For our analysis, we are interested in flnding aggregate statistics 
on visits, such as the average visits per video and the total num¬ 
ber of visits predicated on different subsets of owners. We could 
deflne the following MV that counts the visits for each videold 
associated with owners and the duration. 


CREATE VIEW visitView 

AS SELECT videold , ownerld , duration , 

count(l) as visitCount 

FROM Log, Video WHERE Log. videold = Video . videold 
GROUP BY videold 

As Log table grows, this MV becomes stale, and we denote the 
insertions to the table as: 

Logins(sessionid, videold) 

Staleness does not affect every query uniformly. Even when the 
number of new entries in Logins is small relative to Log, some 
queries might be very inaccurate. For example, views to newly 
added videos may account for most of Logins, so queries that 
count visits to the most recent videos will be more inaccurate. The 
amount of inaccuracy is unknown to the user, who can only esti¬ 
mate an expected error based on prior experience. This assumption 
may not hold in rapidly evolving data. We see an opportunity for 
approximation through sampling which can give bounded query 
results for a reduced maintenance cost. In other words, a small 
amount of up-to-date data allows the user to estimate the magni¬ 
tude of query result error due to staleness. 

2.2 SampleClean | |48| | 

SampleClean is a framework for scalable aggregate query pro¬ 
cessing on dirty data. Traditionally, data cleaning has explored ex¬ 
pensive, up-front cleaning of entire datasets for increased query ac¬ 
curacy. Those who were unwilling to pay the full cleaning cost 
avoided data cleaning altogether. We proposed SampleClean to 
add an additional trade-off to this design space by using sampling, 
i.e., bounded results for aggregate queries when only a sample of 
data is cleaned. The problem of high computational costs for ac¬ 
curate results mirrors the challenge faced in the MV setting with 
the tradeoff between immediate maintenance (expensive and up- 
to-date) and deferred maintenance (inexpensive and stale). Thus, 
we explore how samples of “clean” (up-to-date) data can be used 
for improved query processing on MVs without incurring the full 
cost of maintenance. 

However, the metaphor of stale MVs as a Sample-and-Clean 
problem only goes so far and there are signiflcant new challenges 
that we address in this paper. In prior work, we modeled data clean¬ 
ing as a row-by-row black-box transformation. This model does not 
work for missing and superfluous rows in stale MVs. In particular, 
our sampling method has to account for this issue and we propose a 
hashing based technique to efficiently materialize a uniform sample 
even in the presence of missing/superfluous rows. Next, we greatly 
expand the query processing scope of SampleClean beyond sum, 
count, and avg queries. Bounding estimates that are not sum, 
count, and avg queries, is signiflcantly more complicated. This 
requires new analytic tools such as a statistical bootstrap estima¬ 
tion to calculate confldence intervals. Finally, we add an outlier 
indexing technique to improve estimates on skewed data. 

3. FRAMEWORK OVERVIEW 
3.1 Notation and Definitions 

SVC returns a bounded approximation for aggregate queries on 
stale MVs for a flexible additional maintenance cost. 

Materialized View: Let P be a database which is a collection of 
relations A materialized view S is the result of applying a 

view definition to V. View deflnitions are composed of standard 
relational algebra expressions: Select (cr^^). Project (H), Join (m). 
Aggregation (7), Union (U), Intersection (n) and Difference (—). 
We use the following parametrized notation for joins, aggregations 
and generalized projections: 


• nai,a 2 ,...,afc(^): Generalized projection selects attributes 
{ai,a 2 ,...,afc} from R, allowing for adding new attributes 
that are arithmetic transformations of old ones (e.g., ai +a 2 ). 

• ^(/)(ri,r 2 ) {Ri,R 2 )- Join sclccts all tuples in Ri x R 2 that 
satisfy 0(ri, r 2 ). We use M to denote all types of joins even 
extended outer joins such as D>c, ix, 

• Apply the aggregate function / to the relation R 
grouped by the distinct values of A, where A is a subset of 
the attributes. The DISTINCT operation can be considered 
as a special case of the Aggregation operation. 

The composition of the unary and binary relational expressions can 
be represented as a tree, which is called the expression tree. The 
leaves of the tree are the base relations for the view. Each non¬ 
leave node is the result of applying one of the above relational ex¬ 
pressions to a relation. To avoid ambiguity, we refer to tuples of the 
base relations as records and tuples of derived relations as rows. 

Primary Key: We assume that each of the base relations has a pri¬ 
mary key. If this is not the case, we can always add an extra column 
that assigns an increasing sequence of integers to each record. For 
the defined relational expressions, every row in a materialized view 
can also be given a primary key |14|52| , which we will describe 
in Section This primary key is formally a subset of attributes 
u C {ai, a 2 ,Ufc} such that all s ^ S{u) are unique. 

Staleness: For each relation Ri there is a set of insertions /S.Ri 
(modeled as a relation) and a set of deletions VRi. An “update” 
to Ri can be modeled as a deletion and then an insertion. We refer 
to the set of insertion and deletion relations as “delta relations”, 
denoted by dV\ 

dV = ..., Ai?4 U {VRi ,..., VRk} 

A view S is considered stale when there exist insertions or dele¬ 
tions to any of its base relations. This means that at least one of the 
delta relations in dV is non-empty. 

Maintenance: There may be multiple ways (e.g., incremental main¬ 
tenance or recomputation) to maintain a view S, and we denote the 
up-to-date view as S'. We formalize the procedure to maintain the 
view as a maintenance strategy M. A maintenance strategy is a 
relational expression the execution of which will return S'. It is a 
function of the database V, the stale view S, and all the insertion 
and deletion relations dV. In this work, we consider maintenance 
strategies composed of the same relational expressions as material¬ 
ized views described above. 

S' ^ M{S,V,dD) 

Staleness as Data Error: The consequences of staleness are in¬ 
correct, missing, and superfiuous rows. Formally, for a stale view 
S with primary key u and an up-to-date view S'’. 

• Incorrect: Incorrect rows are the set of rows (identified by 
the primary key) that are updated in S'. For s E S, let s(u) 
be the value of the primary key. An incorrect row is one such 
that there exists sl s' ^ S' with s'(u) = s{u) and s ^ s'. 

• Missing: Missing rows are the set of rows (identified by the 
primary key) that exist in the up-to-date view but not in the 
stale view. For s' ^ S', let s'{u) be the value of the primary 
key. A missing row is one such that there does not exist a 
s ^ S with s{u) = s'(u). 

• Superfluous: Superfluous rows are the set of rows (identi¬ 
fied by the primary key) that exist in the stale view but not in 
the up-to-date view. For s E S, let s(u) be the value of the 
primary key. A superfiuous row is one such that there does 
not exist sl s' E S' with s(u) = s'(u). 


Uniform Random Sampling: We define a sampling ratio m E 
[0,1] and for each row in a view S, we include it into a sample 
with probability m. We use the “hat” notation (e.g., S) to denote 
sampled relations. The relation § is a uniform sample of S if 

m^s eS ’. s E S’, (2) Pr(si eS)^ Pr{s 2 E S) = m. 

We say a sample is clean if and only if it is a uniform random 
sample of the up-to-date view S'. 

Example 1. In this example, we summarize all of the key con¬ 
cepts and terminology pertaining to materialized views, stale data 
error, and maintenance strategies. Our example view, visitView, 
joins the Log table with the Video table and counts the visits for 
each video grouped by videold. Since there is a foreign key rela¬ 
tionship between the relations, this is just a visit count for each 
unique video with additional attributes. The primary keys of the 
base relations are: sessionid for Log and videold for Video. 

If new records have been added to the Log table, the visitView is 
considered stale. Incorrect rows in the view are videos for which 
the visitCount is incorrect and missing rows are videos that had 
not yet been viewed once at the time of materialization. While not 
possible in our running example, superfluous rows would be videos 
whose Log records have all been deleted. Formally, in this example 
our database is T> — {Video, Logj, and the delta relations are 
dV = {Logins}. 

Suppose, we apply the change-table IVM algorithm proposed 
in 

1. Create a “delta view” by applying the view definition to Lo¬ 
gins. That is, calculate the visit count per video on the new 
logs: 

yiVideo M Logins) 

2. Take the full outer join of the “delta view ” with the stale view 
visitView (equality on videold). 

VisitView ^ y{Video M Logins) 

3. Apply the generalized projection operator to add the visit¬ 
Count in the delta view to each of the rows in visitView where 
we treat a NULL value as 0: 

liiyisitView l><L^(Video M Logins)) 
Therefore, the maintenance strategy is: 

M{{VisitView}, {Video, Log}, {Logins}) 

= Il{VisitView JXLy(Video M Logins)) 

3.2 SVC Workflow 

Formally, the workfiow of SVC is: 

1. We are given a view S. 

2. M defines the maintenance strategy that updates S at each 
maintenance period. 

3. The view S is stale between periodic maintenance, and the 
up-to-date view should be S'. 

4. (Problem 1. Stale Sample View Cleaning) We find an expres¬ 
sion C derived from M that cleans a uniform random sample 
of the stale view S to produce a “clean” sample of the up-to- 
date view S'. 

5. (Problem 2. Query Result Estimation) Given an aggregate 
query q and the state query result q{S), we use S' and S to 
estimate the up-to-date result. 

6. We optionally maintain an index of outliers o for improved 
estimation in skewed data. 




Stale Sample View Cleaning: The first problem addressed in this 
paper is how to clean a sample of the stale materialized view. 

Problem 1 (Stale Sample View Clewing). We are 
given a stale view S, a sample of this stale view S with ratio m, the 
maintenance strategy M, the base relations V, and the insertion 
and deletion relations dT>. We want to find a relational expression 
C such that: 


VisitView(videold, 

ownerld, Y 
duration, a 
visitCount) 


M 


/ \ 


R ( (sessionid, videold), 
ownerld, 
duration) 


Log(sessionld, 

videold) 


Video(videold, 
ownerld, 
duration) 


S' = C{S,V,dV), 

where S' is a sample of the up-to-date view with ratio m. 

Query Result Estimation: The second problem addressed in this 
paper is query result estimation. 

Problem 2 (Query Result Estimation). Letqbeanag- 
gregate query of the following form^ 

SELECT agg{a) EROM View WHERE Condition (A); 

If the view S is stale, then the result will be incorrect by some 
value c: 

l{s') = q{S) + c 

Our objective is to find an estimator f such that: 

q{S') ^ f{q{S),S,S') 

Example 2. Suppose a user wants to know how many videos 
have received more than 100 views. 

SELECT COUNT(1) EROM visitView WHERE visitCount > 100; 


Figure 2: Applying the rules described in Definition|^ we illus¬ 
trate how to assign a primary key to a view. 

SELECT videold , count (1) EROM Log 
GROUP BY videold 

The resulting view has one row for every distinct videold. We 
want to materialize a sample of S', that is a sample of distinct 
videold. If we sample the base relation Log first, we do not 
get a sample of the view. Instead, we get a view where every count 
is partial. 

To achieve a sample of S', we need to ensure that for each s G 
S' all contributing rows in subexpressions to s are also sampled. 
This is a problem of row provenance (E) Provenance, also termed 
lineage, has been an important tool in the analysis of materialized 
views (E) and in approximate query processing (5^ . 

Deeinition 1 (Provenance). Let r be a row in relation R, 
let R be derived from some other relation R = expifJ) where 
exp[‘) be a relational expression composed of the expressions de¬ 
fined in Section \3.1\ The provenance of row r with respect to U 
is pu{r). This is defined as the set of rows in U such that for an 
update to any row u 0 puip), it guarantees that r is unchanged. 


Let us suppose the user runs the query and the result is 45. How¬ 
ever, there have now been new records inserted into the Log ta¬ 
ble making this result stale. First, we take a sample of visitView 
and suppose this sample is a 5% sample. In Stale Sample View 
Cleaning (Problem 1), we apply updates, insertions, and deletions 
to the sample to efficiently materialize a 5% sample of the up-to- 
date view. In Query Result Estimation (Problem 2), we estimate 
aggregate query results based on the stale sample and the up-to- 
date sample. 

4 . EFFICIENTLY CLEANING A SAMPLE 

In this section, we describe how to find a relational expression C 
derived from the maintenance strategy A4 that efficiently cleans a 
sample of a stale view S to produce S'. 

4.1 Challenges 

To setup the problem, we first consider two naive solutions to 
this problem that will not work. We could trivially apply A4 to the 
entire stale view S and update it to S', and then sample. While 
the result is correct according to our problem formulation, it does 
not save us on any computation for maintenance. We want to avoid 
materialization of up-to-date rows outside of the sample. However, 
the naive alternative solution is also flawed. Eor example, we could 
just apply M to the stale sample S and a sample of the delta rela¬ 
tions dT>. The challenge is that A4 does not always commute with 
sampling. 

4.2 Provenance 

To understand the commutativity problem, consider maintaining 
a group by aggregate view: 

^ For simplicity, we exclude the group by clause for all queries in the paper, as it can 
be modeled as part of the Condition. 


4.3 Primary Keys 

Eor the relational expressions defined in the previous sections, 
this provenance is well defined and can be tracked using primary 
key rules that are enforced on each subexpression (E). We recur¬ 
sively define a set of primary keys for all relations in the expression 
tree: 

Definition 2 (Primary Key Generation). For every re¬ 
lational expression R, we define the primary key attribute(s) of ev¬ 
ery expression to be: 

• Base Case: All relations (leaves) must have an attribute p 
which is designated as a primary key. 

• cF(f)(R): Primary key of the result is the primary key ofR 

• Il(^ai,...,ak){R): Primary key of the result is the primary key 
ofR. The primary key must always be included in the projec¬ 
tion. 

• ^(/)(ri,r 2 ) (^ 15 ^ 2 ).' Primary key of the result is the tuple of 
the primary keys of Ri and R 2 . 

• 7/, A (^).' The primary key of the result is the group by key A 
(which may be a set of attributes). 

• Ri UR 2 : Primary key of the result is the union of the primary 
keys of Ri and R 2 

• n R 2 : Primary key of the result is the intersection of the 
primary keys of Ri and R 2 

• Ri — R 2 : Primary key of the result is the primary key of Ri 

For every node at the expression tree, these keys are guaranteed to 
uniquely identify a row. 

These rules define a constructive definition that can always be ap¬ 
plied for our defined relational expressions. 

Example 3. A variant of our running example view that does 
not have a primary key is: 



CREATE VIEW visitView AS SELECT count(l) as visitCount 
FROM Log, Video WHERE Log. videold = Video . vide old 
GROUP BY videold 

We illustrate the key generation process in Figure^ Suppose there 
is a base relation, such as Log, that is missing a primary key (ses- 
We can add this attribute by generating an increasing 
sequence of integers for each record in Log. Since both base tables 
Video and Log have primary keys videold and sessionid respec¬ 
tively, the result of the join will have a primary key (videold, ses¬ 
sionid). Since the group by attribute is videold, that becomes the 
primary key of the view. 

4.4 Hashing Operator 

The primary keys allow us to determine the set of rows that con¬ 
tribute to a row r in a derived relation. If we have a deterministic 
way of mapping a primary key to a Boolean, we can ensure that all 
contributing rows are also sampled. To achieve this we use a hash¬ 
ing procedure. Let us denote the hashing operator ria,m{R)- For 
all tuples in R, this operator applies a hash function whose range 
is [0,1] to primary key a (which may be a set) and selects those 
records with hash less than or equal to m|^ 

In this work, we study uniform hashing where the condition 
h{a) < m implies that a fraction of approximately m of the rows 
are sampled. Such hash functions are utilized in other aspects of 
database research and practice (e.g. hash partitioning, hash joins, 
and hash tables). Hash functions in these applications are designed 
to be as uniform as possible to avoid collisions. Numerous em¬ 
pirical studies establish that many commonly applied hash func¬ 
tions (e.g.. Linear, SDBM, MD5, SHA) have negligible differences 
with a true uniform random variable (25|32| . Cryptographic hashes 
work particularly well and are supported by most commercial and 
open source systems, for example MySQL provides MD5 and SHAl 

To avoid materializing extra rows, we push down the hashing 
operator through the expression tree. The further that we can push 
r] down, the more operators (i.e., above the sampling) can benefit. 
This push-down is analogous to predicate push-down operations 
used in query optimizers. In particular, we are interested in find¬ 
ing an optimized relational expression that materializes an identi¬ 
cal sample before and after the push-down. We formalize the push¬ 
down rules below: 

Definition 3 (Hash push-down). For a derived relation 
R, the following rules can be applied to push r]a,m{R) down the 
expression tree. 

• (Tct){R): Push r] through the expression. 

• Il(^ai,...,ak){R)’ Push p through if a is in the projection. 

• {Ri^R 2 ): No push down in general. There are 
special cases below where push down is possible. 

• Push p through if a is in the group by clause A. 

• RiU R 2 : Push p through to both Ri and R 2 

• Rif] R 2 : Push p through to both Ri and R 2 

• Ri — R 2 : Push p through to both Ri and R 2 

Special Case of Joins: In general, a join R[x^ S blocks the push¬ 
down of the hash operator pa,m{R) since a possibly consists of 
attributes in both R and S. However, when there is a constraint 
that enforces these attributes are equal then push-down is possible. 

Foreign Key Join. If we have a join with two foreign-key rela¬ 
tions Ri (fact table with foreign key a) and R 2 (dimension table 

2 

It does not make sense for Video to be missing a primary key in our running example 
due to the foreign key relationship 

3 

For example, if hash function is a 32-bit unsigned integer which we can normalize 
by MAXINT to be in [0, 1]. 


with primary key b C a) and we are sampling the key a, then we 
can push the sampling down to Ri. This is because we are guaran¬ 
teed that for every n G Ri there is only one r 2 G i^ 2 . 

Equality Join. If the join is an equality join and a is one of the 
attributes in the equality join condition Ri .a = R 2 .b, then p can be 
pushed down to both Ri and R 2 . On Ri the pushed down operator 
is Pa,m{Ri) and on R 2 the operator is pb,m{R 2 )- 

Example 4. We illustrate our hashing procedure in terms of 
SQL expressions on our running example. We can push down the 
hash function for the following expressions: 

SELECT * FROM Video WHERE Condition(-) 

SELECT * FROM Video,Log WHERE Video . videold = Log. videold 
SELECT videold, count (1) FROM Log GROUP BY videold 

The following expressions are examples where we cannot push¬ 
down the hash function: 

SELECT * FROM Video , Log 

SELECT c, count(1) 

FROM ( 

SELECT videold , count(l) as c FROM Log 
GROUP BY videold 

) 

GROUP BY c 

In Theorem[^ we prove the correctness of our push-down rules. 

Theorem 1. Given a derived relation R, primary key a, and 
the sample Pa,m{R)- Let S be the sample created by applying pa,m 
without push-down and S' be the sample created by applying the 
push-down rules to pa,m{R)- S and S' are identical samples with 
sampling ratio m. 

Proof Sketch. We can prove this by induction. The base 
case is where the expression tree is only one node, trivially making 
this true. Then, we can induct considering one level of operators in 
the tree, cr, U, n, — clearly commutes with hashing a. H commutes 
only if a is in the projection. For M, a sampling operator on Q can 
be pushed down if a is in either kr or kg, or if there is a constraint 
that links kr to kg. For group by aggregates, if a is in the group 
clause (i.e., it is in the aggregate), then hashing the operand filters 
all rows that have a which is sufficient to materialize the derived 
row. □ 

4.5 Efficient View Cleaning 

If we apply the hashing operator to AT, we can get an optimized 
cleaning expression C that avoids materializing unnecessary rows. 
When applied to a stale sample of a view S, the database V, and the 
delta relations dV, it produces an up-to-date sample with sampling 
ratio m: 

S' ^c(s,v,dv) 

Thus, it addresses Problem 1 from the previous section. 

Example 5. We illustrate our proposed approach on our ex¬ 
ample view visitView with the expression tree listed in Figure 
We start by applying the hashing operator to the primary key 
(videold). The next operator we see in the expression tree is a 
projection that increments the visitCount in the view, and this 
allows for push-down since primary key is in the projection. The 
second expression is a hash of the equality join key which merges 
the aggregate from the “delta view ” to the old view allowing us 
to push down on both branches of the tree using our special case 
for equality joins. On the left side, we reach the stale view so we 
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Figure 3: Applying the rules described in Section [414} we illus¬ 
trate how to optimize the sampling of our example maintenance 
strategy. 

stop. On the right side, we reach the aggregate query (count) and 
since the primary key is in group by clause, we can push down the 
sampling. Then, we reach another point where we hash the equal¬ 
ity join key allowing us to push down the sampling to the relations 

Logins and Video. 

4.6 Corresponding Samples 

We Started with a uniform random sample S of the stale view 
S. The hash push down allows us to efficiently materialize the 
sample S'. S' is a uniform random sample of the up-to-date view 
S. While both of these samples are uniform random samples of 
their respective relations, the two samples are correlated since S' 
is generated by cleaning S. In particular, our hashing technique 
ensures that the primary keys in S' depend on the primary keys in 
S. Statistically, this positively correlates the query result q{S') and 
q{S). We will see how this property can be leveraged to improve 
query estimation accuracy (Section [5T] ). 

Property 1 (Correspondence). Suppose S' and S are 
uniform samples of S' and S, respectively. Let u denote the pri¬ 
mary key. We say S' and S correspond if and only if: 


• Uniformity: S' and S are uniform random samples of S' and 
S respectively with a sampling ratio ofm 

• Removal of Superfluous Rows: D = {Vs ^ S $s' ^ S' : 
s(u) = s'(u)}, D C] S' — ^ 

• Sampling of Missing Rows: I — {Vs' ^ S'^ S '. s{u) — 

s'(u)}, E(| I) = ^ I ^ I 

• Key Preservation for Updated Rows: For all s E S and not 
in D or I, s' E S' : s'{u) = s{u). 


5. QUERY RESULT ESTIMATION 

SVC returns two corresponding samples, S and S'. § is a “dirty” 
sample (sample of the stale view) and S' is a “clean” sample (sam¬ 
ple of the up-to-date view). In this section, we first discuss how to 
estimate query results using the two corresponding samples. Then, 
we discuss the bounds and guarantees on different classes of aggre¬ 
gate queries. 

5.1 Result Estimation 

Suppose, we have an aggregate query q of the following form: 
q(View) := SELECT f(attr) ERCM View WEIERE cond(*) 


We quantify the staleness c of the aggregate query result as the 
difference between the query applied to the stale view S compared 
to the up-to-date view S': 

lis') = q{S)+c 


The objective of this work is to estimate q{S'). In the Approx- 
imate Query Processing (AQP) literature, sample-based estimates 
have been well studied m This inspires our first estimation al¬ 
gorithm, SVC-i-AQP, which uses SVC to materialize a sample view 
and an AQP-style result estimation technique. 

SVC+AQP: Given a clean sample view S', the query q, and a 
scaling factor s, we apply the query to the sample and scale it by s: 

q{S') X s ■ q{S') 


For example, for the sum and count the scaling factor is For 
the avg the scaling factor is 1. Refer to m for a detailed dis¬ 
cussion on the scaling factors. 

SVC-i-AQP returns what we call a direct estimate of ^^(5"). We 
could, however, try to estimate c instead. Since we have the stale 
view S, we could run the query q on the full stale view and es¬ 
timate the difference c using the samples S and S'. We call this 
approach SVC-i-CORR, which represents calculating a correction 
io q{S) instead of a direct estimate. 

SVC-hCORR: Given a clean sample S', its corresponding dirty 
sample S, a query q, and a scaling factor s: 


1. Apply SVC-I-AQP to S': restjresh = s • ^(^S") 

2. Apply SVC-I-AQP to S: Test.stale = S • q{S) 

3. Apply q to the full stale view: rstale = q{S) 

4. Take the difference between (1) and (2) and add it to (3): 

q(S ) 'f'stale {pest-fresh '^eststaled 


A commonly studied property in the AQP literature is unbiased¬ 
ness. An unbiased result estimate means that the expected value 
of the estimate over all samples of the same size is g(*S") We 
can prove that if SVC-i-AQP is unbiased (there is an AQP method 
that gives an unbiased result) then SVC-hCORR also gives unbiased 
results. 


Lemma 1 . If there exists an unbiased sample estimator for q( S') 
then there exists an unbiased sample estimator for c. 

Prooe Sketch. Suppose, we have an unbiased sample esti¬ 
mator Cq of q. Then, it follows that E[eg(5'')] = ^(*S") If we 
substitute in this expression: c = E[eg(AS')] — q{S). Applying the 
linearity of expectation: c = E[eg(*S'') — q{S)] □ 

Some queries do not have unbiased sample estimators, but the bias 
of their sample estimators can be bounded. Example queries in¬ 
clude: median, percentile. A corollary to the previous lemma, 
is that if we can bound the bias for our estimator then we can 
achieve a bounded bias for c as well. 


Example 6. We can formalize our earlier example query in 
Section^in terms of SVC-\-CORR and SVC-\-AQP. Let us suppose 
the initial query result is 45. There now have been new log records 
inserted into the Log table making the old result stale, and suppose 
we are working with a sampling ratio of 5%. For SVC-\-AQP, we 
count the number of videos in the clean sample that currently have 
counts greater than 100 and scale that result by = 20. If the 
count from the clean sample is 4, then the estimate for SVCAQP is 

4 

The avg query is considered conditionally unbiased in some works. 









80. For SVC+CORR, we also run SVC+AQP on the dirty sample. 
Suppose that there are only two videos in the dirty sample with 
counts above 100, then the result of running SVC+AQP on the dirty 
sample is 20 • 2 = 40. We take the difference of the two values 
80 — 40 = 40. This means that we should correct the old result by 
40 resulting in the estimate 6>/45 + 40 = 85. 

5.2 Confidence Intervals 

To bound our estimates in confidence intervals we explore three 
cases: ( 1 ) aggregates that can be written as sample means, ( 2 ) ag¬ 
gregates that can be bounded empirically with a statistical boot¬ 
strap, and (3) min and max. For (1), sum, count, and avg can 
all be written as sample means, sum is the sample mean scaled by 
the relation size and count is the mean of the indicator function 
scaled by the relation size. In this case, we can get analytic confi¬ 
dence intervals which allows us to analyze the efficiency tradeoffs. 
In case (2), for example median, we lose this property and have 
to use an empirical technique to bound the results. Queries such 
as min and max fall into their own category as they cannot eas¬ 
ily be bounded empirically 0 , and we discuss these queries in our 
Technical Report p^ . 

5.2.1 Confidence Intervals For Sample Means 

The first case is aggregates that can be expressed as a sample 
mean (sum, count, and avg) Sample means for uniform random 
samples (also called sampling without replacement) converge to the 
population mean by the Central Limit Theorem (CLT). Let /x be a 
sample mean calculated from k samples, be the variance of the 
sample, and p be the population mean. Then, the error (/x — /x) is 
normally distributed: A^(0, ^). Therefore, the confidence interval 
is given by: 



where 7 is the Gaussian tail probability value (e.g., 1.96 for 95%, 
2.57 for 99%). 

We discuss how to calculate this confidence interval in SQL for 
SVC-i-AQR The first step is a query rewriting step where we move 
the predicate COnd(*) into the SELECT clause (1 if true, 0 if false). 
Let attr be the aggregate attribute and m be the sampling ratio. We 
define an intermediate result trans which is a table of transformed 
rows with the first column the primary key and the second column 
defined in terms of cond (*) statement and scaling. For sum: 

trans= SELECT pk , 1 .0/m-attr-condC*) as trans.attr ERCM s 
For count: 

trans= SELECT pk, 1.0/m • cond(*) as trans_attr EROM s 

For avg since there is no scaling we do not need to re-write the 
query: 

trans= SELECT pk,attr as trans.attr FRCM s WEIE^ cond(*) 

SVC+AQP: The confidence interval on this result is defined as: 
SELECT 7 -stdev ( trans _attr ) / sqrt ( count (1) ) EROM trans 

To calculate the confidence intervals for SVC+CORR we have to 
look at the statistics of the difference, i.e., c = q{S) — q{S'), from 
a sample. If all rows in S exist in S', we could use the associativity 
of addition and subtraction to rewrite this as: c = q{S — S'), where 
— is the row-by-row difference between S and S' . The challenge 
is that the missing rows on either side make this ill-defined. Thus, 
we defined the following null-handling with a subtraction operator 
we call —. 


Definition 4 (Correspondence Subtract). Given an ag¬ 
gregate query, and two corresponding relations Ri and R 2 with 
the schema (ai, a 2 ,...) where ai is the primary key for Ri and R 2 , 
and 02 is the aggregation attribute for the query. — is defined as a 
projection of the full outer join on equality ofRi.ai = R 2 .ai: 

^Rl.a2-R2-a2{Fi IXL R 2 ) 

Null values 0 are represented as zero. 

Using this operator, we can define a new intermediate result diff: 
diff := transits')—transits) 

SVC+CORR: Then, as in SVC+AQP, we bound the result using 
the CLT: 

SELECT 7 - stdev ( Iran s _attr )/ sqrt (count (1) ) EROM diff 

5.2.2 AQP VS. CORK For Sample Means 

In terms of these bounds, we can analyze how SVC+AQP com¬ 
pares to SVC+CORR for a fixed sample size k. SVC+AQP gives 

an estimate that is proportional to the variance of the clean sample 
^2 2 
view: SVC+CORR to the variance of the differences: 

Since the change is the difference between the stale and up-to-date 
view, this can be rewritten as 

cr| + cr^/ — 2cov{S, S') 
k 

Therefore, a correction will have less variance when: 

as < 2cov{S, s') 

As we saw in the previous section, correspondence correlates the 
samples. If the difference is small, i.e., S is nearly identical to S', 
then cov{S, S') ^ a%. This result also shows that there is a point 
when updates to the stale MV are significant enough that direct 
estimates are more accurate. When we cross the break-even point 
we can switch from using SVC+CORR to SVC+AQP SVC+AQP 
does not depend on cov{S, S') which is a measure of how much 
the data has changed. Thus, we guarantee an approximation error 

of at most In our experiments (Figure j^b)), we evaluate this 
break even point empirically. 

5.2.3 Selectivity For Sample Means 

Let p be the selectivity of the query and k be the sample size; that 
is, a fraction p records from the relation satisfy the predicate. For 
these queries, we can model selectivity as a reduction of effective 
sample size k • p making the estimate variance: O(^). Thus, 
the confidence interval’s size is scaled up by Just like there 
is a tradeoff between accuracy and maintenance cost, for a fixed 
accuracy, there is also a tradeoff between answering more selective 
queries and maintenance cost. 

5.2.4 Optimality For Sample Means 

Optimality in unbiased estimation theory is defined in terms of 
the variance of the estimate GD 

Proposition 1. An estimator is called a minimum variance 
unbiased estimator (MVUE) if it is unbiased and the variance of 
the estimate is less than or equal to that of any other unbiased 
estimate. 

A sampled relation R defines a discrete distribution. It is impor¬ 
tant to note that this distribution is different from the data generat¬ 
ing distribution, since even if R has continuous valued attributes R 



still defines a discrete distribution. Our population is finite and we 
take a finite sample thus every sample takes on only a discrete set 
of values. In the general case, this distribution is only described by 
the set of all of its values (i.e., no smaller parametrized representa¬ 
tion). In this setting, the sample mean is an MVUE. In other words, 
if we make no assumptions about the underlying distribution of 
values in R, SVC-i-AQP and SVC-i-CORR are optimal for their re¬ 
spective estimates (q{S') and c). Since they estimate different vari¬ 
ables, even with optimality SVC-i-CORR might be more accurate 
than SVC-i-AQP and vice versa. There are, however, some cases 
when the assumptions, namely zero-knowledge, of this optimality 
condition do not hold. As a simple counter example, if we knew 
our data were exactly on a line, a sample size of two is sufficient 
to answer any aggregate query. However, even for many paramet¬ 
ric distributions, the sample mean estimators are still MVUEs, e.g., 
poisson, bernouilli, binomial, normal, and exponential. It is often 
difficult and unknown in many cases to derive an MVUE other than 
a sample mean. Our approach is valid for any choice of estimator 
if one exists, even though we do the analysis for sample mean esti¬ 
mators and this is the setting in which that estimator is optimal. 

5.2.5 Bootstrap Confidence Intervals 

In the second case, we explore bounding queries that cannot be 
expressed as sample means. We do not get analytic confidence in¬ 
tervals on our results, nor is it guaranteed that our estimates are 
optimal. In AQP, the commonly used technique is called a sta¬ 
tistical bootstrap 0 to empirically bound the results. In this ap¬ 
proach, we repeatedly subsample with replacement from our sam¬ 
ple and apply the query to the sample. This gives us a technique to 
bound SVC-I-AQP the details of which can be found in (3052). Eor 
SVC-I-CORR, we have to propose a variant of bootstrap to bound 
the estimate of c. In this variant, repeatedly estimate c from sub¬ 
samples and build an empirical distribution for c. 

SVC-hCORR: To use bootstrap to find a 95% confidence interval: 

1 . Subsample and Ssuh with replacement from S' and S 
respectively 

2. Apply SVC-hAQP to and 

3. Record the difference ‘{aqp{S'gub) — ciqp{Ssub)) 

4. Return to 1, for k iterations. 

5. Return the 97.5% and the 2.5% percentile of the distribution 
of results. 

6 . OUTLIER INDEXING 

Sampling is known to be sensitive to outliers (zno) . Power-laws 
and other long-tailed distributions are common in practice fT0| . 
The basic idea is that we create an index of outlier records (records 
whose attributes deviate from the mean value greatly) and ensure 
that these records are included in the sample, since these records 
greatly increase the variance of the data. 

6.1 Indices on the Base Relations 

The first step is that the user selects an attribute of any base re¬ 
lation to index and specifies a threshold t and a size limit /c. In a 
single pass of updates (without maintaining the view), the index is 
built storing references to the records with attributes greater than 
t. If the size limit is reached, the incoming record is compared to 
the smallest indexed record and if it is greater then we evict the 
smallest record. The same approach can be extended to attributes 
that have tails in both directions by making the threshold t a range, 
which takes the highest and the lowest values. However, in this 
section, we present the technique as a threshold for clarity. 


There are many approaches to select a threshold. We can use 
prior information from the base table, a calculation which can be 
done in the background during the periodic maintenance cycles. If 
our size limit is k, for a given attribute we can select the the top-k 
records with that attributes. Then, we can use that top-k list to set 
a threshold for our index. Then, the attribute value of the lowest 
record becomes the threshold t. Alternatively, we can calculate the 
variance of the attribute and set the threshold to represent c standard 
deviations above the mean. This threshold can be adaptively set at 
each maintenance period. 

6.2 Adding Outliers to the Sample 

Given this index, the next question is how we can use this infor¬ 
mation in our materialized views. We need to propagate the indices 
upwards through the expression tree. We add the condition that 
the only eligible indices are ones on base relations that are being 
sampled (i.e., we can push the hash operator down to that relation). 
Therefore, in the same iteration as sampling, we can also test the 
index threshold and add records to the outlier index. We formalize 
the propagation property recursively. Every relation can have an 
outlier index which is a set of attributes and a set of records that 
exceed the threshold value on those attributes. The main idea is 
to treat the indexed records as a sub-relation that gets propagated 
upwards with the maintenance strategy. 

Definitions (outlier index pushup). an out¬ 

lier index to be a tuple of a set of indexed attributes, and a set 
of records (/, O). The outlier index propagates upwards with the 
following rules: 

• Base Relations: Outlier indices on base relations are pushed 
up only if that relation is being sampled, i.e., if the sampling 
operator can be pushed down to that relation. 

• Push up with a new outlier index and apply the se¬ 
lection to the outliers (/, cr(/,( 0 )) 

• Push upwards (/ fl (ai,..., a^), O). 

• ^(t){ri,r 2 ) {Ri^R 2 ): Push upwards (/i U / 2 , Oi M O 2 ). 

• 7/, A (7^).' For group-by aggregates, we set I to be the aggre¬ 
gation attribute. For the outlier index, we do the following 
steps. (1) Apply the aggregation to the outlier index^f^A{0), 
(2) for all distinct A in O select the row in jf^AiR) ^ith the 
same A, and (3) this selection is the new set of outliers O. 

• Ri Ui? 2 .‘ Push up with a new outlier index (/i n/ 2 , Oi UO 2 ). 
The set of index attributes is combined with an intersection 
to avoid missed outliers. 

• Ri ni^ 2 .‘ Push up with a new outlier index (/i n/ 2 , Oi n 02 ). 

• Ri —R 2 : Push up with a new outlier index (/ 1 U/ 2 , Oi —O 2 ). 

Eor all outlier indices that can propagate to the view (i.e., the top 
of the tree), we get a final set O of records. Given these rules, O 
is, in fact, a subset of our materialized view S'. Thus, our query 
processing can take advantage of the theory described in the previ¬ 
ous section to incorporate the set O into our results. We implement 
the outlier index as an additional attribute on our sample with a 
boolean fiag true or false if it is an outlier indexed record. If a row 
is contained both in the sample and the outlier index, the outlier 
index takes precedence. This ensures that we do not double count 
the outliers. 

6.3 Query Processing 

Eor result estimation, we can think of our sample S' and our 
outlier index O as two distinct parts. Since O C S', and we give 
membership in our outlier index precedence, our sample is actu¬ 
ally a sample restricted to the set (S' — O). Eor a given query, let 






Creg be the correction calculated on {S' — O) using the technique 
proposed in the previous section and adjusting the sampling ratio 
m to account for outliers removed from the sample. We can also 
apply the technique to the outlier set O since this set is determinis¬ 
tic the sampling ratio for this set is m = 1, and we call this result 
Cout- Let N be the count of records that satisfy the query’s con¬ 
dition and I be the number of outliers that satisfy the condition. 
Then, we can merge these two corrections in the following way: 
V — ^^^Creg + j^Cout- For thc qucrics in the previous section that 
are unbiased, this approach preserves unbiasedness. Since we are 
averaging two unbiased estimates Creg and Cout, the linearity of the 
expectation operator preserves this property. Furthermore, since 
Cout is deterministic (and in fact its bias/variance is 0), Creg and 
Cout are uncorrelated making the bounds described in the previous 
section applicable as well. 

Example 7. We chose an attribute in the base data to index, 
for example duration, and an example threshold of 1.5 hours. 
We apply the rules to push the index up, and this materializes the 
entire set of rows whose duration is longer than 1.5 hours. For 
SVC-\-AQP, we run the query on the set of clean rows with durations 
longer than 1.5 hours. Then, we use the update rule in Section \6^ 
to update the result based on the number of records in the index 
and the total size of the view. For SVC+CORR, we additionally run 
the query on the set of dirty rows with durations longer than 1.5 
hours and take the difference between SVC+AQP. As in SVC-\-AQP, 
we use the update rule in Section \6^ to update the result based on 
the number of records in the index and the total size of the view. 

7. RESULTS 

We evaluate SVC first on a single node MySQL database to eval¬ 
uate its accuracy, performance, and efficiency in a variety of mate¬ 
rialized view scenarios. Then, we evaluate the outlier indexing ap¬ 
proach in terms of improved query accuracy and also evaluate the 
overhead associated with using the index. After evaluation on the 
benchmark, we present an application of server log analysis with a 
dataset from a video streaming company, Conviva. 

7.1 Experimental Setup 

Single-node Experimental Setup: Our single node experiments 
are run on a r3.large Amazon EC2 node (2x Intel Xeon E5-2670, 
15.25 GB Memory, and 32GB SSD Disk) with a MySQL ver¬ 
sion 5.6.15 database. These experiments evaluate views from a 
10GB TPCD-Skew dataset. TPCD-Skew dataset 0 is based on 
the Transaction Processing Council’s benchmark schema (TPCD) 
but is modified so that it generates a dataset with values drawn 
from a Zipfian distribution instead of uniformly. The Zipfian dis¬ 
tribution p7) is a long-tailed distribution with a single parameter 
z = {1,2,3,4} where a larger value means a more extreme tail 
and z = 1 corresponds to the basic TPCD benchmark. In our 
experiments, we use use z = 2 unless otherwise noted. The in¬ 
cremental maintenance algorithm used in our experiments is the 
“change-table” or “delta-table” method used in numerous works in 
incremental maintenance |22|23|27j . In all of the applications, the 
updates are kept in memory in a temporary table, and we discount 
this loading time from our experiments. We build an index on the 
primary keys of the view, but not on the updates. Below we de¬ 
scribe the view definitions and the queries on the view0 

Join View: In the TPCD specification, two tables receive inser¬ 
tions and updates: lineitem and orders. Out of 22 parametrized 
queries in the specification, 12 are group-by aggregates of the join 

^ Refer to our extended paper on more details about the experimental setup |29|. 


of lineitem and orders (Q3, Q4, Q5, Q7, Q8, Q9, QIO, Q12, Q14, 
Q18, Q19, Q21). Therefore, we define a materialized view of the 
foreign-key join of lineitem and orders, and compare incremental 
view maintenance and SVC. We treat the 12 group-by aggregates 
as queries on the view. 

Complex Views: Our goal is to demonstrate the applicability 
of SVC outside of simple materialized views that include nested 
queries and other more complex relational algebra. We take the 
TPCD schema and denormalize the database, and treat each of the 
22 TPCD queries as views on this denormalized schema. The 22 
TPCD queries are actually parametrized queries where parameters, 
such as the selectivity of the predicate, are randomly set by the 
TPCD qgen program. Therefore, we use the program to generate 
10 random instances of each query and use each random instance 
as a materialized view. 10 out of the 22 sets of views can benefit 
from SVC. Eor the 12 excluded views, 3 were static (i.e, there are 
no updates to the view based on the TPCD workload), and the re¬ 
maining 9 views have a small cardinality not making them suitable 
for sampling. 

Eor each of the views, we generated queries on the views. Since 
the outer queries of our views were group by aggregates, we picked 
a random attribute a from the group by clause and a random at¬ 
tribute h from aggregation. We use a to generate a predicate. 
Eor each attribute a, the domain is specified in the TPCD stan¬ 
dard. We select a random subset of this domain, e.g., if the at¬ 
tribute is country then the predicate can be countryCode > 50 and 
COUntryCode < 100. We generated 100 random sum, avg, and 
count queries for each view. 

Distributed Experimental Setup: We evaluate SVC on Apache 
Spark 1.1.0 with 1TB of logs from a video streaming company, 
Conviva Q. This is a denormalized user activity log corresponding 
to video views and various metrics such as data transfer rates, and 
latencies. Accompanying this data is a four month trace of queries 
in SQL. We identified 8 common summary statistics-type queries 
that calculated engagement and error-diagnosis metrics. These 8 
queries defined the views in our experiments. We populated these 
view definitions using the first 800GB of user activity log records. 
We then applied the remaining 200GB of user activity log records 
as the updates (i.e., in the order they arrived) in our experiments. 
We generated aggregate random queries over this view by taking 
either random time ranges or random subsets of customers. 

7.1.1 Metrics and Evaluation 

No maintenance (Stale): The baseline for evaluation is not ap¬ 
plying any maintenance to the materialized view. 

Incremental View Maintenance (IVM): We apply incremental 
view maintenance (change-table based maintenance | |22|23|27| ) to 
the full view. 

SVC+AQP: We maintain a sample of the materialized view using 
SVC and estimate the result with AQP-style estimation technique. 
SVC+CORR: We maintain a sample of the materialized view 
using SVC and process queries on the view using the correction 
which applies the AQP to both the clean and dirty samples, and 
uses both estimates to correct a stale query result. 

Since SVC has a sampling parameter, we denote a sample size 
of x% as SVC+CORR-x or SVC+AQP-x, respectively. To evaluate 
accuracy and performance, we define the following metrics: 
Relative Error: Eor a query result r and an incorrect result r', the 
relative error is ^ . When a query has multiple results (a group- 
by query), then, unless otherwise noted, relative error is defined as 
the median over all the errors. 

Maintenance Time: We define the maintenance time as the time 








needed to produce the up-to-date view for incremental view main¬ 
tenance, and the time needed to produce the up-to-date sample in 
SVC. 

7.2 Join View 

In our first experiment, we evaluate how SVC performs on a ma¬ 
terialized view of the join of lineitem and orders. We generate a 
10GB base TPCD-Skew dataset with skew z — 2, and derive the 
view from this dataset. We first generate 1GB (10% of the base 
data) of updates (insertions and updates to existing records), and 
vary the sample size. 

Performance: Figure Qa) shows the maintenance time of SVC 
as a function of sample size. With the bolded dashed line, we note 
the time for full IVM. For this materialized view, sampling allows 
for significant savings in maintenance time; albeit for approximate 
answers. While full incremental maintenance takes 56 seconds, 
SVC with a 10% sample can complete in 7.5 seconds. 

The speedup for SVC-lOIn the next figure, Figure|^b), we eval¬ 
uate this speedup. We fix the sample size to 10% and plot the 
speedup of SVC compared to IVM while varying the size of the 
updates. On the x-axis is the update size as a percentage of the 
base data. For small update sizes, the speedup is smaller, 6.5x for 
a 2.5% (250MB) update size. As the update size gets larger, SVC 
becomes more efficient, since for a 20% update size (2GB), the 
speedup is 10. lx. The super-linearity is because this view is a join 
of lineitem and orders and we assume that there is not a join index 
on the updates. Since both tables are growing sampling reduces 
computation super-linearly. 

Accuracy: At the same design point with a 10% sample, we 
evaluate the accuracy of SVC. In Figure we answer TPCD 
queries with this view. The TPCD queries are group-by aggregates 
and we plot the median relative error for SVC-i-CORR, No Main¬ 
tenance, and SVC-i-AQP. On average over all the queries, we found 
that SVC-i-CORR was 11.7x more accurate than the stale baseline, 
and 3.lx more accurate than applying SVC-i-AQP to the sample. 

SVC-hCORR vs. SVC-I-AQP: While more accurate, it is true 
that SVC-i-CORR moves some of the computation from mainte¬ 
nance to query execution. SVC-i-CORR calculates a correction to 
a query on the full materialized view. On top of the query time on 
the full view (as in IVM) there is additional time to calculate a cor¬ 
rection from a sample. On the other hand SVC-i-AQP runs a query 
only on the sample of the view. We evaluate this overhead in Fig¬ 
ure Ha), where we compare the total maintenance time and query 
execution time. For a 10% sample SVC-i-CORR required 2.69 secs 
to execute a sum over the whole view, IVM required 2.45 secs, and 
SVC-i-AQP required 0.25 secs. However, when we compare this 
overhead to the savings in maintenance time it is small. 

SVC-I-CORR is most accurate when the materialized view is less 
stale as predicted by our analysis in Section |5.2.2| On the other 
hand SVC-i-AQP is more robust to the staleness and gives a consis¬ 
tent relative error. The error for SVC-i-CORR grows proportional 
to the staleness. In Figurej^b), we explore which query processing 
technique, SVC-i-CORR or SVC-i-AQP, should be used. For a 10% 
sample, we find that SVC-i-CORR is more accurate until the update 
size is 32.5% of the base data. 

7.3 Complex Views 

In this experiment, we demonstrate the breadth of views sup¬ 
ported by SVC by using the TPCD queries as materialized views. 
We generate a 10GB base TPCD-Skew dataset with skew z = 2, 
and derive the views from this dataset. We first generate 1GB (10% 
of the base data) of updates (insertions and updates to existing 
records), and vary the sample size. Figure|^shows the maintenance 
time for a 10% sample compared to the full view. This experiment 




Figure 4: (a) On a 10GB view with 1GB of insertions and up¬ 
dates, we vary the sampling ratio and measure the maintenance 
time of SVC. (b) For a fixed sampling ratio of 10%, we vary the 
update size and plot the speedup compared to full incremental 
maintenance. 



Figure 5: For a fixed sampling ratio of 10% and update size 
of 10% (1GB), we generate 100 of each TPCD parameterized 
queries and answer the queries using the stale materialized 
view, SVC-hCORR, and SVC-hAQP. We plot the median rela¬ 
tive error for each query. 

illustrates how the view definitions plays a role in the efficiency 
of our approach. For the last two views, V21 and V22, we see 
that sampling does not lead to as large of speedup indicated in our 
previous experiments. This is because both of those views contain 
nested structures which block the pushdown of hashing. V21 con¬ 
tains a subquery in its predicate that does not involve the primary 
key, but still requires a scan of the base relation to evaluate. V22 
contains a string transformation of a key blocking the push down. 
These results are consistent with our previous experiments showing 
that SVC is faster than IVM and more accurate than SVC-i-AQP and 
no maintenance. 

7.4 Outlier Indexing 

In our next experiment, we evaluate our outlier indexing with 
the top-k strategy described in Section In this setting, outlier 
indexing significantly helps for both SVC-i-AQP and SVC-i-CORR. 
We index the Lextendedprice attribute in the lineitem table. We 
evaluate the outlier index on the complex TPCD views. We find that 
four views: V3,V5,V10, V15, can benefit from this index with our 
push-up rules. These are four views dependent on Lextendedprice 
that were also in the set of “Complex” views chosen before. 

In our first outlier indexing experiment (Figurej^a)), we analyze 
V3. We set an index of 100 records, and applied SVC-i-CORR and 
SVC-I-AQP to views derived from a dataset with a skew parameter 
z = {1, 2,3,4}. We run the same queries as before, but this time 
we measure the error at the 75% quartile. We find in the most 
skewed data SVC with outlier indexing reduces query error by a 
factor of 2. Next, in Figure |^(b), we plot the overhead for outlier 
indexing for V3 with an index size of 0, 10, 100, and 1000. While 
there is an overhead, it is still small compared to the gains made 
by sampling the maintenance strategy. We note that none of the 
prior experiments used an outlier index. The caveat is that these 
experiments were done with moderately skewed data with Zipfian 
parameter = 2, if this parameter is set to 4 then the 75% quartile 
query estimation error is nearly 20% (Figure |^). Outlier indexing 
always improves query results as we are reducing the variance of 
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Figure 6: (a) For a fixed sampling ratio of 10% and update size 
of 10% (1GB), we measure the total time incremental mainte¬ 
nance + query time, (b) SVC+CORR is more accurate than 
SVC+AQP until a break even point. 
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Figure 8: (a) For one view V3 and 1GB of updates, we plot 
the 75% quartile error with different techniques as we vary 
the skewness of the data, (b) While the outlier index adds an 
overhead this is small relative to the total maintenance time. 
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Figure 7: (a) For 1GB update size, we compare maintenance 
time and accuracy of SVC with a 10% sample on different 
views. V21 and V22 do not benefit as much from SVC due to 
nested query structures, (b) For a 10% sample size and 10% 
update size, SVC+CORR is more accurate than SVC+AQP and 
No Maintenance. 

the estimation set, however, this reduction in variance is largest 
when there is a longer tail. 

7.5 Conviva 

We derive the views from 800GB of base data and add 80GB 
of updates. These views are stored and maintained using Apache 
Spark in a distributed environment. The goal of this experiment 
is to evaluate how SVC performs in a real world scenario with a 
real dataset and a distributed architecture. In Figure [^a), we show 
that on average over all the views, SVC-10% gives a 7.5x speedup. 
For one of the views full incremental maintenance takes nearly 800 
seconds, even on a 10-node cluster, which is a very significant cost. 
In Figure l^b), we show that SVC also gives highly accurate re¬ 
sults with an average error of 0.98%. These results show consis¬ 
tency with our results on the synthetic datasets. This experiment 
highlights a few salient benefits of SVC: (1) sampling is a rela¬ 
tively cheap operation and the relative speedups in a single node 
and distributed environment are similar, (2) for analytic workloads 
like Conviva (i.e., user engagement analysis) a 10% sample gives 
results with 99% accuracy, and (3) savings are still significant in 
systems like Spark that do not support selective updates. 

7.6 Additional Experiments 

7.6.1 Aggregate View 

In our next experiment, we evaluate an aggregate view use case 
similar to a data cube. We generate a 10GB base TPCD dataset with 


Figure 9: (a) We compare the maintenance time of SVC with 
a 10% sample and full incremental maintenance, and find that 
as with TPCD SVC saves significant maintenance time, (b) We 
also evaluate the accuracy of the estimation techniques. 

skew z — 1, and derive the base cube as a materialized view from 
this dataset. We add 1GB of updates and apply SVC to estimate the 
results of all of the “roll-up” dimensions. 

Performance: We observed the same trade-off as the previous 
experiment where sampling significantly reduces the maintenance 
time (Figure a)). It takes 186 seconds to maintain the entire 
view, but a 10% sample can be maintained in 26 seconds. As be¬ 
fore, we fix the sample size at 10% and vary the update size. We 
similarly observe that SVC becomes more efficient as the update 
size grows (Figurep^b)), and at an update size of 20% the speedup 
is 8.7x. 

Accuracy: In Figure [TT] we measure the accuracy of each of 
the “roll-up” aggregate queries on this view. That is, we take each 
dimension and aggregate over the dimension. We fix the sample 
size at 10% and the update size at 10%. On average SVC+Corr is 
12.9x more accurate than the stale baseline and 3.6x more accurate 
than SVC+AQP (Figure[T^c)). 

Since the data cubing operation is primarily constructed by 
group-by aggregates, we can also measure the max error for each 
of the aggregates. We see that while the median staleness is close 
to 10%, for some queries some of the group aggregates have nearly 
80% error (PigurefTT). SVC greatly mitigates this error to less than 
12% for all queries. 

Other Queries: Finally, we also use the data cube to illus¬ 
trate how SVC can support a broader range of queries outside of 
sum, count, and avg. We change all of the roll-up queries to 
use the median function (Figure pj]). First, both SVC+Corr and 
SVC+AQP are more accurate as estimating the median than they 
were for estimating sums. This is because the median is less sensi¬ 
tive to variance in the data. 
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Figure 10: (a) In the aggregate view case, sampling can save 
significant maintenance time, (b) As the update size grows SVC 
tends towards an ideal speedup of lOx. 
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Figure 11: We measure the accuracy of each of the roll-up 
aggregate queries on this view. For a 10% sample size and 
10% update size, we find that SVC+Corr is more accurate than 
SVC+AQP and No Maintenance. 


Figure 12: For 1GB of updates, we plot the max error as op¬ 
posed to the median error in the previous experiments. Even 
though updates are 10% of the dataset size, some queries are 
nearly 80% incorrect. SVC helps significantly mitigate this er¬ 
ror. 



Figure 13: We run the same experiment but replace the sum 
query with a median query. We find that similarly SVC is more 
accurate. 


7. 6.2 Mini-batch Experiments 

We devised an end-to-end experiment simulating a real integra¬ 
tion with periodic maintenance. However, unlike the MySQL case, 
Apache Spark does not support selective updates and insertions as 
the “views” are immutable. A further point is that the immutability 
of these views and Spark’s fault-tolerance requires that the “views” 
are maintained synchronously. Thus, to avoid these significant 
overheads, we have to update these views in batches. Spark does 
have a streaming variant (51), however, this does not support the 
complex SQL derived materialized views used in this paper, and 
still relies on mini-batch updates. 

SVC and IVM will run in separate threads each with their own 
RDD materialized view. In this application, both SVC and IVM 
maintain respective their RDDs with batch updates. In this model, 
there are a lot of different parameters: batch size for periodic main¬ 
tenance, batch size for SVC, sampling ratio for SVC, and the fact 
that concurrent threads may reduce overall throughput. Our goal 
is to fix the throughput of the cluster, and then measure whether 
SVC+IVM or IVM alone leads to more accurate query answers. 

Batch sizes: In Spark, larger batch sizes amortize overheads bet¬ 
ter. In Figure p^a), we show a trade-off between batch size and 
throughput of Spark for V2 and V5. Throughputs for small batches 
are nearly lOx smaller than the throughputs for the larger batches. 

Concurrent SVC and IVM: Next, we measure the reduction 
in throughput when running multiple threads. We run SVC-10 in 
loop in one thread and IVM in another. We measure the reduction 
in throughput for the cluster from the previous batch size experi¬ 
ment. In Figure p^b), we plot the throughput against batch size 
when two maintenance threads are running. While for small batch 
sizes the throughput of the cluster is reduced by nearly a factor of 
2, for larger sizes the reduction is smaller. As we found in later 
experiments (Figure [^, larger batch sizes are more amenable to 
parallel computation since there was more idle CPU time. 

Choosing a Batch Size: The results in Figure [^a) and Fig¬ 
ure [^b) show that larger batch sizes are more efficient, however. 


larger batch sizes also lead to more staleness. Combining the re¬ 
sults in Figure [^a) and Figure [^b), for both SVC+IVM and 
IVM, we get cluster throughput as a function of batch size. For 
a fixed throughput, we want to find the smallest batch size that 
achieves that throughput for both. For V2, we fixed this at 700,000 
records/sec and for V5 this was 500,000 records/sec. For IVM 
alone the smallest batch size that met this throughput demand was 
40GB for both V2 and V5. And for SVC+IVM, the smallest batch 
size was 80GB for V2 and 100GB for V5. When running periodic 
maintenance alone view updates can be more frequent, and when 
run in conjunction with SVC it is less frequent. 

We run both of these approaches in a continuous loop, 
SVC+IVM and IVM, and measure their maximal error during a 
maintenance period. There is further a trade-off with the sam¬ 
pling ratio, larger samples give more accurate estimates however 
between SVC batches they go stale. We quantify the error in these 
approaches with the max error; that is the maximum error in a 
maintenance period (Figure p~5]). These competing objective lead 
to an optimal sampling ratio of 3% for V2 and 6% for V5. At this 
sampling point, we find that applying SVC gives results 2.8x more 
accurate for V2 and 2x more accurate for V5. 

To give some intuition on why SVC gives more accurate results, 
in Figure we plot the average CPU utilization of the cluster 
for both periodic IVM and SVC+periodic IVM. We find that SVC 
takes advantage of the idle times in the system; which are common 
during shuffle operations in a synchronous parallelism model. 

In a way, these experiments present a worst-case application for 
SVC, yet it still gives improvements in terms of query accuracy. In 
many typical deployments throughput demands are variable forcing 
maintenance periods to be longer, e.g., nightly. The same way that 
SVC takes advantage of micro idle times during communication 
steps, it can provide large gains during controlled idle times when 
no maintenance is going on concurrently. 

8 . RELATED WORK 
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Figure 14: (a) Spark RDDs are most efficient when updated 
in batches. As batch sizes increase the system throughput in¬ 
creases. (b) When running multiple threads, the throughput 
reduces. However, larger batches are less affected by this re¬ 
duction. 


Max Error V2 


5 % 
5 4 % 



0 % ^- 

0 0.05 0.1 0.15 0.2 

Sampiing Ratio 


Max Error V5 

8% 



0 % ^ - 

0 0.05 0.1 0.15 0.2 

Sampiing Ratio 


Figure 15: For a fixed throughput, SVC+Periodic Maintenance 
gives more accurate results for V2 and V5. 


Addressing the cost of materialized view maintenance is the sub¬ 
ject of many recent papers, which focus on various perspectives 
including complex analytical queries transactions Q, real¬ 
time analytics p4) , and physical design d). The increased re¬ 
search focus parallels a major concern in industrial systems for 
incrementally updating pre-computed results and indices such as 
Google Percolator | [4^ and Twitter’s Rainbird | [4^ . The stream¬ 
ing community has also studied the view maintenance problem 
|2|18|20|24|28) . In Spark Streaming, Zaharia et al. studied how 
they could exploit in-memory materialization pT| , and in Mon- 
etDB, Liarou et al. studied how ideas from columnar storage can 
be applied to enable real-time analytics p4) . These works focus on 
correctness, consistency, and fault tolerance of materialized view 
maintenance. SVC proposes an alternative model for view main¬ 
tenance where we allow approximation error (with guarantees) for 
queries on materialized views for vastly reduced maintenance time. 
In many decision problems, exact results are not needed as long as 
the probability of error is boundable. Sampling has been well stud¬ 
ied in the context of query processing mm- Both the prob¬ 
lems of efficiently sampling relations m and processing complex 
queries ||^, have been well studied. In SVC, we look at a new 
problem, where we efficiently sample from a maintenance strategy, 
a relational expression that updates a materialized view. We gen¬ 
eralize uniform sampling procedures to work in this new context 
using lineage E) and hashing. We look the problem of approxi¬ 
mate query processing | |3|4| from a different perspective by estimat¬ 
ing a “correction” rather than estimating query results. Srinivasan 
and Carey studied a problem related to query correction which they 
called compensation-based query processing | [46) for concurrency 
control but did not study this for sampled estimates. This work was 
applied in the context of concurrency control. However, this work 
did not consider applications when the correction was applied to a 
sample as in SVC. The sampling in SVC introduces new challenges 
such as sensitivity to outliers, questions of bias, and estimate opti¬ 
mality. 


Figure 16: SVC better utilizes idle times in the cluster by main¬ 
taining the sample. 


Sampling has also been studied from the perspective of main¬ 
taining samples | [4^ . In p6) , Joshi and Jermaine studied indexed 
materialized views that are amenable to random sampling. While 
similar in spirit (queries on the view are approximate), the goal 
of this work was to optimize query processing and not to address 
the cost of incremental maintenance. There has been work using 
sampled views in a limited context of cardinality estimation p0| , 
which is the special case of our framework, namely, the count 
query. Nirkhiwale et al. p^ , studied an algebra for estimating con¬ 
fidence intervals in aggregate queries. The objective of this work 
is not sampling efficiency, as in SVC, but estimation. As a special 
case, where we consider only views constructed from select and 
project operators, SVC’s hash pushdown will yield the same results 
as their model. There has been theoretical work on the maintenance 
of approximate histograms, synopses, and sketches |12|19| , which 
closely resemble aggregate materialized views. The objectives of 
this work (including techniques such as sketching and approximate 
counting) have been to reduce the required storage, not to reduce 
the required update time. 

Meliou et al. proposed a technique to trace errors in an MV 
to base data and find responsible erroneous tuples. They do not, 
however, propose a technique to correct the errors as in SVC. Cor¬ 
recting general errors as in Meliou et al. is a hard constraint sat¬ 
isfaction problem. However, in SVC, through our formalization of 
staleness, we have a model of how updates to the base data (mod¬ 
eled as errors) affect MVs, which allows us to both trace errors and 
clean them. Wu and Madden | [50) did propose a model to correct 
“outliers” in an MV through deletion of records in the base data. 
This is a more restricted model of data cleaning than SVC, where 
the authors only consider changes to existing rows in an MV (no 
insertion or deletion) and do not handle the same generality of re¬ 
lational expressions (e.g., nested aggregates). Challamalla et al. |[^ 
proposed an approximate technique for specifying errors as con¬ 
straints on a materialized view and proposing changes to the base 
data such that these constraints can be satisfied. While comple¬ 
mentary, one major difference between the three works |6|36|50| 
and SVC is that they require an explicit specification of erroneous 
rows in a materialized view. Identifying whether a row is erroneous 
requires materialization and thus specifying the errors is equivalent 
to full incremental maintenance. We use the formalism of a “main¬ 
tenance strategy”, the relational expression that updates the view, 
to allow us to sample rows that are not yet materialized. However, 
while not directly applicable for staleness, we see SVC as com¬ 
plementary to these works in the dirty data setting. The sampling 
technique proposed in Section 4 of our paper could be used to ap¬ 
proximate the data cleaning techniques in |6|36|50| and this is an 
exciting avenue of future work. 

Sampling has been explored in the streaming community, and a 
similar idea of sampling from incoming updates has also been ap¬ 
plied in stream processing |16|44|47| . While some of these works 
studied problems similar to materialization, for example, the Jet- 















































Stream project (Rabkin et al.) looks at how sampling can help 
with real-time analysis of aggregates. None of these works for¬ 
mally studied the class views that can benefit from sampling or 
formalized queries on these views. However, there are ideas from 
Rabkin et al. that could be applied in SVC in future work, for ex¬ 
ample, their description of coarsening operations in aggregates is 
very similar to our experiments with the “roll-up” queries in aggre¬ 
gate views. There are a variety of other efforts proposing storage 
efficient processing of aggregate queries on streams Gim) which 
is similar to our problem setting and motivation. 

9. LIMITATIONS AND OPPORTUNITIES 

While our experiments show that SVC works for a variety of ap¬ 
plications, there are a few limitations which we summarize in this 
section. There are two primary limitations for SVC: class of queries 
and types of materialized views. In this work, we primarily fo¬ 
cused on aggregate queries and showed that accuracy decreases as 
the selectivity of the query increases. Sampled-based methods are 
fundamentally limited in the way they can support “point lookup” 
queries that select a single row. This is predicted by our theoreti¬ 
cal result that accuracy decreases with ^ where p is the fraction of 
rows that satisfy the predicate. In terms of more view definitions, 
SVC does not support views with ordering or “top-k” clauses, as 
our sampling assumes no ordering on the rows of the MV and it 
is not clear how sampling commutes with general ordering oper¬ 
ations. In the future, we will explore maintenance optimizations 
proposed in recent work. For example, DBToaster has two main 
components, higher-order delta processing and a SQL query com¬ 
piler, both of which are complementary to SVC. SVC proposes a 
new approach for accurate query processing with MVs. Our results 
are promising and suggest many avenues for future work. In par¬ 
ticular, we are interested in deeper exploration of the multiple MV 
setting. There are many interesting design problems such as given 
storage constraints and throughput demands, optimize sampling ra¬ 
tios over all views. Furthermore, there is an interesting challenge 
about queries that join mutliple sample MVs managed by SVC. 
We are also interested in the possibility of sharing computation be¬ 
tween MVs and maintenance on views derived from other views. 
Finally, our results suggest relatively a straight forward implemen¬ 
tation of adaptive selection of the parameters in SVC such as the 
view sampling ratio and the outlier index threshold. 

10. CONCLUSION 

Materialized view maintenance is often expensive, and in prac¬ 
tice, eager view maintenance is often avoided due to its costs. This 
leads to stale materialized views which have incorrect, missing, and 
superfiuous rows. In this work, we formalize the problem of stale¬ 
ness and view maintenance as a data cleaning problem. SVC uses a 
sample-based data cleaning approach to get accurate query results 
that refiect the most recent data for a greatly reduced computational 
cost. To achieve this, we significantly extended our prior work 
in data cleaning, SampleClean 148) , for efficient cleaning of stale 
MVs. This included processing a wider set of aggregate queries, 
handling missing data errors, and proving for which queries opti¬ 
mality of the estimates hold. We presented both empirical and the¬ 
oretical results showing that our sample data cleaning approach is 
significantly less expensive than full view maintenance for a large 
class of materialized views, while still providing accurate aggregate 
query answers that reflect the most recent data. 

Our results are promising and suggest many avenues for future 
work. In this work, we focused on aggregate queries and showed 
that accuracy decreases as the selectivity of the query increases. 
Sampled-based methods are fundamentally limited in the way they 
can support “point lookup” queries that select a single row, and 


we believe we can address this problem with new results in non- 
parametric machine learning instead of using single-parameter esti¬ 
mators. In particular, we are interested in deeper exploration of the 
multiple MV setting. There are also many interesting design prob¬ 
lems such as given storage constraints and throughput demands, 
optimize sampling ratios over all views. 
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12. APPENDIX 
12.1 Extensions 


12.1.1 MIN and MAX 

min and max fall into their own category since this is a canoni¬ 
cal case where bootstrap fails. We devise an estimation procedure 
that corrects these queries. However, we can only achieve bound 
that has a slightly different interpretation than the confidence inter¬ 
vals seen before. We can calculate the probability that a larger (or 
smaller) element exists in the unsampled view. 

We devise the following correction estimate for max: (1) For all 
rows in both S and S', calculate the row-by-row difference, (2) let 
c be the max difference, and (3) add c to the max of the stale view. 

We can give weak bounds on the results using Cantelli’s Inequal¬ 
ity. If X is a random variable with mean ptx and variance var{X), 
then the probability that X is larger than a constant e 


P(JA > e + p.x) < 


var{X) 
var{X) + 


Therefore, if we set e to be the difference between max value es¬ 
timate and the average value, we can calculate the probability that 
we will see a higher value. 

The same estimator can be modified for min, with a correspond¬ 
ing bound: 


P(JY < iLx - a)) < 


var(x) 
var{x) + 


This bound has a slightly different interpretation than the confi¬ 
dence intervals seen before. This gives the probability that a larger 
(or smaller) element exists in the unsampled view. 

12.1.2 Select Queries 

In SVC, we also explore how to extend this correction procedure 
to Select queries. Suppose, we have a Select query with a predicate: 


SELECT * EROM View WHE^ Condition (A) ; 


We first run the Select query on the stale view, and this returns a 
set of rows. This result has three types of data error: rows that are 
missing, rows that are falsely included, and rows whose values are 
incorrect. 

As in the sum, count, and avg query case, we can apply the 
query to the sample of the up-to-date view. From this sample, us¬ 
ing our lineage defined earlier, we can quickly identify which rows 
were added, updated, and deleted. For the updated rows in the sam¬ 
ple, we overwrite the out-of-date rows in the stale query result. For 
the new rows, we take a union of the sampled selection and the up¬ 
dated stale selection. For the missing rows, we remove them from 
the stale selection. To quantify the approximation error, we can 
rewrite the Select query as count to get an estimate of number of 




rows that were updated, added, or deleted (thus three “confidence” 
intervals). 

12.2 Extended Proofs 

12.3 Is Hashing Equivalent To RNG? 

In this work, we argue that hashing can be used for “sampling” 
a relational expression. However, from a complexity theory per¬ 
spective, hashing is not equivalent to random number generation 
(RNG). The existence of true one-way hash functions is a conjec¬ 
ture that would imply P ^ NP. This conjecture is often taken as 
an assumption in Cryptography. Of course, the ideal one-way hash 
functions required by the theory do not exist in practice. How¬ 
ever, we find that existing hashes (e.g., linear hashes and SHAl) 
are sufficiently close to ideal that they can still take advantage of 
this theory. On the other hand, a SHAl hash is nearly an order of 
magnitude slower but is much more uniform. This assumption is 
called the Simple Uniform Hashing Assumption (SUHA) fTT| , and 
is widely used to analyze the performance of hash tables and hash 
partitioning. There is an interesting tradeoff between the latency in 
computing a hash compared to its uniformity. For example, a lin¬ 
ear hash stored procedure in MySQL is nearly as fast pseudoran¬ 
dom number generation that would be used in a TABLES AMPLE 
operator, however this hash exhibits some non-uniformity. 

12.3.1 Hashing and Correspondence 

A benefit of deterministic hashing is that when applied in con¬ 
junction to the primary keys of a view, we get the Correspondence 
Property (Definitionfor free. 

Proposition 2 (Hashing Correspondence). Suppose 
we have S which is the stale view and S' which is the up-to-date 
view. Both these views have the same schema and a primary key 
a. Let r]a,m be our hash function that applies the hashing to the 
primary key a. 

S = r]a,m{3) 

Then, two samples S' and S correspond. 

Prooe. There are four conditions for correspondence: 

• (1) Uniformity: S' and S are uniform random samples of S' 
and S respectively with a sampling ratio of m 

• (2) Removal of Superfluous Rows: D = {Vs ^ S$s' ^ S' \ 
s{u) = s'(u)}, D S' — ^ 

• (3) Sampling of Missing Rows: I — {Vs' ^ S'^s ^ S \ 
s{u) — s'(u)}, E(| lr\S'\)=m\I\ 

• (4) Key Preservation for Updated Rows: Eor all s G & and 
not in D or /, s' ^ S' : s'{u) = s(u). 

Uniformity is satisfied under by definition under SUHA (Simple 
Uniform Hashing Assumption). Condition 2 is satisfied since if r 
is deleted, then r ^ S' which implies that r ^ S'. Condition 3 is 
just the converse of 2 so it is satisfied. Condition 4 is satisfied since 
if r is in then it was sampled, and then since the primary key is 
consistent between S and S' it will also be sampled in S'. □ 

12.4 Theorem 1 Proof 

Theorem 2. Given a derived relation R, primary key a, and 
the sample rja,m{R)- Let S be the sample created by applying rja,m 
without push down and S' be the sample created by applying the 
push down rules to rja,m{R)- S and S' are identical samples with 
sampling ratio m. 


Prooe. We can prove this by induction. The base case is where 
the expression tree is only one node, trivially making this true. 
Then, we can induct considering one level of operators in the tree, 
cr, U, n, — clearly commute with hashing the key a allowing for 
push down. H commutes only if a is in the projection. Eor M, a 
sampling operator on Q can be pushed down if a is in either kr 
or ks, or if there is a constraint that links kr to kg. There are two 
cases in which this happens a foreign-key relationship or an equal¬ 
ity join on the same key. Eor group by aggregates, if a is in the 
group clause (i.e., it is in the aggregate) then a hash of the operand 
filters all rows that have a which is sufficient to materialize the de¬ 
rived row. It is provably NP-Hard to pushdown through a nested 
group by aggregate such as: 

SELECT c, count(1) 

FKOM ( 

SELECT videold , sum(l) as c ERCM Log 
GROUP BY videold 

) 

GROUP BY c 

by reduction to a SUB SET-SUM problem. □ 

12.5 More about the Hash Operator 

We defined a concept of tuple-lineage with primary keys. How¬ 
ever, a curious property of the deterministic hashing technique is 
that we can actually hash any attribute while retain the important 
statistical properties. This is because a uniformly random sample 
of any attribute (possibly not unique) still includes every individ¬ 
ual row with the same probability. A consequence of this is that 
we can push down the hashing operator through arbitrary equality 
joins (not just many-to-one) by hashing the join key. 

We defer further exploration of this property to future work as it 
introduces new tradeoffs. Eor example, sampling on a non-unique 
key, while unbiased in expectation, has higher variance in the size 
of the sample. Happening to hash a large group may lead to de¬ 
creased performance. 

Suppose our keys are duplicated pk times on average with vari¬ 
ance cTfc, then the variance of the sample size is for sampling frac¬ 
tion m: 

m{l — m)nk + (1 “ Tn)crl 

This equation is derived from the formula for the variance of a mix¬ 
ture distribution. In this setting, our sampling would have to con¬ 
sider this variance against the benefits of pushing the hash operator 
further down the query tree. 

12.6 Experimental Details 

12.6.1 Join View TPCD Queries 

In our first experiment, we materialize the join of lineitem and 
orders. We treat the TPCD queries as queries on the view, and we 
selected 12 out of the 22 to include in our experiments. The other 
10 queries did not make use of the join. 

12.6.2 Conviva Views 

In this workload, there were annotated summary statistics 
queries, and we filtered for the most common types. While, we 
cannot give the details of the queries, we can present some of the 
high-level characteristics of 8 summary-statistics type views. 

• VI. Counts of various error types grouped by resources, 
users, date 

• V2. Sum of bytes transferred grouped by resource, users, date 

• V3. Counts of visits grouped by an expression of resource 
tags, users, date. 


• V4. Nested query that groups users from similar regions/ser¬ 
vice providers together then aggregates statistics 

• V5. Nested query that groups users from similar regions/ser¬ 
vice providers together then aggregates error types 

• V6. Union query that is filtered on a subset of resources and 
aggregates visits and bytes transferred 

• V7. Aggregate network statistics group by resources, users, 
date with many aggregates. 

• V8. Aggregate visit statistics group by resources, users, date 
with many aggregates. 

12.6.3 Data Cube Specification 

We defined the base cube as a materialized view: 
select 

sum( 1 _extendedprice * (1 — l_discount)) as revenue, 
c-custkey , n.nationkey , 
r_regionkey , L_PARTKEY 

from 

lineitem , orders , 
customer , nation , 
region 

where 

l-orderkey = o.orderkey and 
0_CUSTKEY = c_custkey and 
c.nationkey = n.nationkey and 
N_REGIONKEY = r_regionkey 

group by 

c_custkey , n_nationkey , 
r.regionkey , L_PARTKEY 

Each of queries was an aggregate over subsets of the dimensions 
of the cube, with a sum over the revenue column. 

• Ql. all 

• Q2. c_custkey 

• Q3. n_nationkey 

• Q4. r_regionkey 

• Q5. Lpartkey 

• Q6. c_custkey,n_nationkey 

• Q7. c_custkey,r_regionkey 

• Q8. c_custkey,Lpartkey 

• Q9. n_nationkey, r_regionkey 

• QIO. n_nationkey, Lpartkey 

• Qll. c_custkey,n_nationkey, r_regionkey 

• Q12. c_custkey,n_nationkey,l_partkey 

• Q13. n_nationkey,r_regionkey,Lpartkey 

When we experimented with the median query, we changed the 
sum to a median of the revenues. 

12.6.4 Table Of TP CD Queries 2 

We denormalize the TPCD schema and treat each of the 22 
queries as views on the denormalized schema. In our experiments, 
we evaluate 10 of these with SVC. Here, we provide a table of the 
queries and reasons why a query was not suitable for our exper¬ 
iments. The main reason a query was not used was because the 
cardinality of the result was small. Since we sample from the view, 
if the result was small eg. j 10, it would not make sense to apply 
SVC. Furthermore, in the TPCD specification the only tables that 
are affected by updates are lineitem and orders; and queries that do 
not depend on these tables do not change; thus there is no need for 
maintenance. 

Listed below are excluded queries and reasons for their exclu- 


• Query 1. Result cardinality too small 

• Query 2. The query was static 

• Query 6. Result cardinality too small 

• Query 7. Result cardinality too small 

• Query 8. Result cardinality too small 

• Query 11. The query was static 

• Query 12. Result cardinality too small 

• Query 14. Result cardinality too small 

• Query 15. The query contains an inner query, which we treat 
as a view. 

• Query 16. The query was static 

• Query 17. Result cardinality too small 

• Query 19. Result cardinality too small 

• Query 20. Result cardinality too small 


Sion. 




